4. Main Analysis (Exploratory Data Analysis)
The topic of interest is NYC subway, specifically Manhattan. There were questions on NYC subway that we wanted to find answers for, such as which time period was the busiest, which stations were the busiest, difference in human traffic patterns between weekdays and weekends. Then, there were other questions related to NYC subway that we wanted to explore, e.g., which station had the highest crime rate, whether or not weather (rainfall) affected subway travelling patterns, and correlation (positve or negative) between taxi and subway usage.
Therefore, we have four main groups of data: NYC subway turnstile counter readings, crime data, weather data and taxi data.
We start by exploring subway turnstile data on its own, before we move on to looking at its relationships with other variables that we are interested in (i.e., crime, weather, taxi).
4.1 Static illustration of turnstile data
library(tidyverse)
# read data
turnstile = read.csv("data/2015_manhattan_turnstile_usage.csv")
4.1.1 Average by day of week
# GroupBy 1.day & 2.interval --> average entry & exit volume
data1 <- turnstile %>% select(interval, day, entry_volume, exit_volume) %>% group_by(day, interval) %>% summarise(avg_entry = mean(entry_volume), avg_exit = mean(exit_volume))
# Reoreder by day & interval
data1$day <- factor(data1$day, c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))
data1$interval <- factor(data1$interval, c("08PM-12AM","04PM-08PM","12PM-04PM","08AM-12PM","04AM-08AM","12AM-04AM"))
ggplot(data1, aes(y = avg_entry, x = interval)) +
geom_col(col='#0072B2', fill="#66CC99") + ylab("Entry Count") + xlab("Interval") + facet_wrap(~ day) + coord_flip()
ggplot(data1, aes(y = avg_exit, x = interval)) +
geom_col(col='#0072B2', fill='#E69F00') + ylab("Exit Count") + xlab("Interval") + facet_wrap(~ day) + coord_flip()
- First of all, we would see how the average number of entry & exit changes over time of a day, faceted by day of week. Variables, “day” & “interval”, are releveled by the order of time so that we can easily catch the trend along with time.
- The main feature detected through following plots is that the peak time for entry is between “4pm - 8pm” and the peak time for exit is between “8am - 12pm”.
- The insights we can get from that feature is that people are coming into Manhattan from outside of Manhattan between “8am - 12pm”, which is the peak time of exit. Also, lots of people are going out from Manhattan between “4pm - 8pm”, which is the peak time of entry.
4.1.2 The effect of weekday & weekend
- In the second part of static analysis of subway data, we are exploring to see how the trend of subway usage changes by weekday and weekend.
- In 2-1, we are looking at the general trend of subway entry & exit over a week. Then, in 2-2, we will see the difference between weekday and weekend.
4.1.2-1 Average by all
# GroupBy interval --> average entry & exit volume
data2_1 <- turnstile %>% select(interval, entry_volume, exit_volume) %>% group_by(interval) %>% summarise(avg_entry = mean(entry_volume), avg_exit = mean(exit_volume))
# Reoreder interval
data2_1$interval <- factor(data2_1$interval, c("08PM-12AM","04PM-08PM","12PM-04PM","08AM-12PM","04AM-08AM","12AM-04AM"))
ggplot(data2_1, aes(y = avg_entry, x = interval)) +
geom_col(col='#0072B2', fill="#66CC99") + ylab("Entry Count") + xlab("Interval") + coord_flip()
ggplot(data2_1, aes(y = avg_exit, x = interval)) +
geom_col(col='#0072B2', fill='#E69F00') + ylab("Exit Count") + xlab("Interval") + coord_flip()
- To see the subway usage trend over a week, we grouped data by interval without considering day of week and averaged count of entry and exit by corresponding interval of time. Again, to see the changes in entry and exit over time, we releveled variable “interval” in an order of time.
- As we saw in part 1, the peak time for exit is between “8am - 12pm” and the peak time for the entry is between “4pm - 8pm” and the entry, which correnspond to the time people coming into Manhattan for their work and going back to home ater work.
4.1.2-2 Average by weekday vs weekend & holiday
- We further explore our data to see the difference in trend of using subway on Weekdays and Weekends.
- For variable “day”, value of “Monday - Friday” were substituted with “Weekday”, given it’s not a holiday. If it is a holiday, “Saturday” or “Sunday”, then we substituted it with “Weekend”.
- We grouped data by “Weekday” and “Weekend” to caculate the average number of entry and exit over interval.
- We faceted graphs by “Weekday” and “Weekend”.
# GroupBy 1.day & 2.interval --> average entry & exit volume
turnstile$is_holiday <- as.character(turnstile$is_holiday)
data2_2 <- turnstile %>% select(interval, day, is_holiday, entry_volume, exit_volume) %>% group_by(day, is_holiday, interval) %>% summarise(avg_entry = mean(entry_volume), avg_exit = mean(exit_volume))
# Change the value of "day" to "Weekday" or "Weekend"
day_list = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")
for (i in day_list){
if (i != "Saturday" & i != "Sunday"){
data2_2[,"day"] <- data.frame(lapply(data2_2[,"day"], function(x) {gsub(i, "Weekday", x)}))
}
else {
data2_2[,"day"] <- data.frame(lapply(data2_2[,"day"], function(x) {gsub(i, "Weekend", x)}))
}
}
data2_2 <- data2_2 %>% ungroup() %>% mutate(day = if_else(day == "Weekday" & is_holiday == "False", "Weekday", "Weekend"))
data2_2 <- data2_2 %>% group_by(day, interval) %>% summarise(avg_entry = mean(avg_entry), avg_exit = mean(avg_exit))
data2_2$interval <- factor(data2_2$interval, c("08PM-12AM","04PM-08PM","12PM-04PM","08AM-12PM","04AM-08AM","12AM-04AM"))
ggplot(data2_2, aes(y = avg_entry, x = interval)) +
geom_col(col='#0072B2', fill="#66CC99") + ylab("Entry Count") + xlab("Interval") + facet_wrap(~ day) + coord_flip()
ggplot(data2_2, aes(y = avg_exit, x = interval)) +
geom_col(col='#0072B2', fill='#E69F00') + ylab("Exit Count") + xlab("Interval") + facet_wrap(~ day) + coord_flip()
- The first interesting observation we made here is the difference in the number of subway usage during weekday and weekend. During weekdays, the average number of both entry and exit of subway is significantly higher than that of Weekends. Since there are lots of travelers in Manhattan over the year, we expected there would be some difference in trend but not in the count of subway usage. However, following plots show that there is actually a huge gap between subway usage count during weekdays and weekends.
- For the second observation, trends in subway usage is changing from weekdays to weekends. As we observed in the previous parts, peak time for subway exit and enty is corresponding to the time people going to work and coming back home. However, during the weekend, there is no such trends anymore. The peak time for both entry and exit is between “12pm - 4pm”. It is possibly because that people are not going to their work during weekends and the pattern from people who are commuting from outside of Manhattan is less reflected on data for weekend tims.
- There is another interesting observation from the exit graph. During weekdays, there are lots of people exiting subway between “4am - 8am” but much less number of people are exiting subway during weekend around that time. This is possibly because that there are lots of people going to the office in early morning during weekdays.
4.1.3 Average by station & interval
- For people who are stranger in Manhattan and hate crowed place, we’ve have the top 5 crowed stations that you should avoid in terms of exit and entry.
data3 <- turnstile %>% select(station, station_id, entry_volume, exit_volume) %>% mutate(station_unique = paste(station, station_id)) %>% group_by(station_unique) %>% summarise(avg_entry = mean(entry_volume), avg_exit = mean(exit_volume))
data3 %>%
ungroup() %>%
arrange(avg_entry) %>%
mutate(station_unique = reorder(station_unique, avg_entry)) %>% tail(5) %>%
ggplot(aes(y = avg_entry, x = station_unique)) +
geom_col(col='#0072B2', fill="#66CC99") + ylab("Entry Count") + xlab("Station") + coord_flip()
data3 %>%
ungroup() %>%
arrange(avg_exit) %>%
mutate(station_unique = reorder(station_unique, avg_exit)) %>% tail(5) %>%
ggplot(aes(y = avg_exit, x = station_unique)) +
geom_col(col='#0072B2', fill="#E69F00") + ylab("Exit Count") + xlab("Station") + coord_flip()
- Above graphs show that “Grand Central”, “Herald Square”, “Union Square”, “Port Authority” and “Time Square” are the most crowded stations in Manhattan.
- Note that there are two Grand central stations in above graph. Since when we are preprocessing the subway data with longitude and latitude information, we gave the unique station id for entrances that are one or more blocks away.
- Even though the counts for Grand Central station is seperated into parts, it is ranked as the most crowed station.
4.2 Static illustration of crime, weather (rainfall) and subway traffic data
Now that we have looked at subway human traffic data on its own, we move on to explore its relationship with crime (those committed in subway stations) and weather (rainfall).
First, we look at which subway stations are the most “dangerous” (!) and which are the safest.
4.2.1 Bar charts of crime numbers against subway stations
By looking at these bar charts displaying the top 5 stations in terms of crime count, we see that the crime count by station for the 3 different crime types were closely related. For example, 125 ST (line 4, 5, 6) and 23 ST were in the top 5 for all 3 types of crime. We see that the similarity between misdemeanor and violation was strongest among the 3 possible pairings of crime type (i.e., felony-misdemeanor, felony-voilation, misdemeanor-violation). In a way, this was not surprising because misdemeanor and violation are more similar than felony which is a more serious type of crime.
Out of interest, we also looked at the 10 “safest” stations. The patterns are a little harder to infer compared with the most dangerous stations because there are many ties. Some interesting observations include: wall street station is one of the safest, whether we are looking at felony or misdemeanor. And Columbia on 116th street is also one of the safest for misdemeanor!
4.2.2 Box plots of crime numbers by time of the day
For overall crime count, the median was rather consistent across the time periods. The number of crimes committed was highest from 1600-2000hr, which coincided with the evening peak period. Based on that, we expected the morning peak (0800-1200hr) to display the next highest crime count, but the data did not support that. Instead, 1200-1600hr showed the second highest crime count based on median. Also, variance (as indicated by length of the box) was highest for the time periods with the highest median crime count. The patterns we saw in the overall count were similarly visible for felony and misdemeanor. For violation, the only similarity with the other crime types was that 1600-200hr was the period with the highest crime rate.
4.2.3-1 Scatter plots of Subway Human Traffic against Crime Count (by weekend, weekday; by crime type)
One data point = one day of the year
For the above and subsequent similar scatter plots, each data point on the graph represented a single day (e.g., total human traffic across all stations for that day, total crimes committed across all stations for that day).
We created the above scatter plots to investigate if there was a relationship between human traffic and crime rate at subway stations.
We plotted the charts separately by weekday and weekend to isolate any effects that weekday vs weekend might have on the relationship between human traffic count and crime count.
Overall, there was a positive correlation between human traffic and crime count for both weekday and weekend. A similar pattern was observed for felony and misdemeanor. However, the relationship was less obvious for violation for which the sample sizes were small.
4.2.3-2 Scatter plots of Subway Human Traffic against Rainfall (by weekend, weekday)
One data point = one day of the year
The above scatter plots investigated if there was a relationship between rainfall and human traffic at subway stations.
For weekdays, we saw that human traffic was not much influenced by rainfall, which was not surprising because everyone had to go to work/school regardless of whether or not it was raining. For weekends, we saw a stronger negative relationship between rainfall and human traffic, which made sense because people might cancel their outdoor activities or leisure travelling plans depending on the weather.
We also saw that most of the data points were clustered around the y-axis, which was due to the fact that on most days there were no rain. On that note, we had to highlight here that the relationship that we saw here would be very susceptible to outlier effect, i.e., the regression slope that was plotted on the above graph may shift significantly if there was another outlier that, for instance, represented a day with higher rainfall and higher traffic.
4.2.3-3 Scatter plots of Crime Count against Rainfall (by crime type)
One data point = one day of the year
The above scatter plots investigated if there was a relationship between crime rate and rainfall at subway stations.
There did not seem to be a strong relationship between overall crime count and rainfall but we did notice that on days with heavy rain, crime count was never high, especially when we drilled down to look at misdemenor and violations. One explanation would be that heavy rainfall might have deterred potential offenders from travelling to the subway stations.
However, we noticed that this was not true for felony. High counts of felony were observed even for days with heavy rain, suggesting that felony was less dependent on the weather.
Previously, we saw that subway human traffic was sensitive to rainfall on weekends but not so much on weekdays. Therefore, we wondered if crime count would likewise be more sensitive to rainfall on weekends. Based on visual inspection, it did not seem like weekend changed the relationship between crime rate and rainfall.
4.2.4 Scatter plots of Crime Count against Subway Human Traffic (by crime type; by weekend, weekday)
One data point = one subway station
In the previous set of scatter plots, we were using each point to represent a day aggregating across all subway stations. For this set of scatter plts, we instead aggregated across time and let each point represent a unique subway station. The focus here was to investigate if a subway station with higher traffic also suffered from higher crime rate.
We noticed a general trend of higher crime count for stations with higher human traffic, and this was true regardless of weekday or weekend, or crime type. There were two outliers with lower traffic but very high crime count (23 ST on line 6 and 125 ST on line 4, 5 & 6), which meant that for these two stations, their higher crime rate could not be well explained by human traffic alone. Other factors affecting crime rate could be whether or not that neighborhood tended to have higher crime rate. Also, lower traffic could also work in the reverse, as a station that is more isolated may attract more potential offenders, since their crimes could be more easily committed unseen.
Lastly, the inverse relationship that we observed here between minor crime counts and rainfall may only be a proxy for the other relationships that we have previously observed (i.e., observed lower crime rate for lower traffic, lower traffic with lower rainfall).
4.2.5 Time Series of Crime Count across time; superimposed with Subway Human Traffic, Rainfall
The above set of time series was meant to explore the trend of crime across time and also to see if it varied in the same direction as the other parameters of human traffic and rainfall.
Firstly, we could see large fluctuations in crime count across time. This would not be caused by intra-day patterns because we were already looking at total crime count per day. Therefore, we looked at the crime count across time only for weekdays and then only for weekends, but the fluctations remained. It was clear that there were other factors influencing crime count not reflected in our analysis.
Ignoring the noise, it seemed like crime rate was lower near the start and end of the year with two prominent peaks between Apr and Jun.
Then, we looked at crime count by crime type. The picture was even less clear, with the more fluctuations dominanting the graph.
In an attempt to isolate the noise, we looked at the crime count for just the top 5 stations (in terms of highest crime rate). The peaks between Apr and Jun were still there but there was now a new higher peak in Nov which was not visible previously in the overall chart. The same peak could be observed when we looked at crime rate by type.
Next, we looked at human traffic and crime count across time and there seemed to be a positive correlation. This matched our earlier observation with scatter plots.
We further tried to look at rainfall with human traffic, and rainfall with crime rate, but these two graphs were dominated by large fluctuations in rainfall and were not informative. It was clear to us that scatter plot offered a better way to visualize relationship between different variables especially if we did not believe that time was a major influencing factor for two parameters (e.g., crime and rainfall) in the same way.